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Abstract 

We propose a nonparametric generalization of 
belief propagation. Kernel Belief Propagation 
(KBP), for pairwise Markov random fields. Mes- 
sages are represented as functions in a repro- 
ducing kernel Hilbert space (RKHS), and mes- 
sage updates are simple linear operations in the 
RKHS. KBP makes none of the assumptions 
commonly required in classical BP algorithms: 
the variables need not arise from a finite do- 
main or a Gaussian distribution, nor must their 
relations take any particular parametric form. 
Rather, the relations between variables are rep- 
resented implicitly, and are learned nonparamet- 
rically from training data. KBP has the advan- 
tage that it may be used on any domain where 
kernels are defined (M"^, strings, groups), even 
where explicit parametric models are not known, 
or closed form expressions for the BP updates 
do not exist. The computational cost of mes- 
sage updates in KBP is polynomial in the train- 
ing data size. We also propose a constant time 
approximate message update procedure by rep- 
resenting messages using a small number of ba- 
sis functions. In experiments, we apply KBP to 
image denoising, depth prediction from still im- 
ages, and protein configuration prediction: KBP 
is faster than competing classical and nonpara- 
metric approaches (by orders of magnitude, in 
some cases), while providing significantly more 
accurate results. 

1 Introduction 

Belief propagation is an inference algorithm for graphical 
models that has been widely and successfully applied in a 
great variety of domains, including vision ( Sudderth et al.| 



|2003 ), protein folding ( Yanover & Weiss 2002 ), and turbo 



decoding (McEhece et al. 19981. In these applications. 



the variables are usually assumed either to be finite dimen- 
sional, or in continuous cases, to have a Gaussian distri- 
bution ( [Weiss & Freeman"] |200 1 | l. In many applications of 
graphical models, however, the variables of interest are nat- 
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urally specified by continuous, non-Gaussian distributions. 
For example, in constructing depth maps from 2D images, 
the depth is both continuous valued and has a multimodal 
distribution. Likewise, in protein folding, angles are mod- 
eled as continuous valued random variables, and are pre- 
dicted from amino acid sequences. In general, multimodal- 
ities, skewness, and other non-Gaussian statistical features 
are present in a great many real-world problems. The corre- 
sponding inference procedures for parametric models typ- 
ically involve integrals for which no closed form solutions 
exist, and are without computationally tractable exact mes- 
sage updates. Worse still, parametric models for the rela- 
tions between the variables may not even be known, or may 
be prohibitively complex. 

Our first contribution in this paper is a novel generalization 
of belief propagation for pairwise Markov random fields. 
Kernel BP, based on a reproducing kernel Hilbert space 
(RKHS) representation of the relations between random 
variables. This extends earlier work of Song et al. (20101 
on inference for trees to the case of graphs with loops. The 
algorithm consists of two parts, both nonparametric: first, 
we learn RKHS representations of the relations between 
variables directly from training data, which removes the 
need for an explicit parametric model. Second, we pro- 
pose a belief propagation algorithm for inference based 
on these learned relations, where each update is a linear 
operation in the RKHS (although the relations themselves 
may be highly nonlinear in the original space of the vari- 
ables). Our approach applies not only to continuous-valued 
non-Gaussian variables, but also generalizes to strings and 
graphs ( [Scholkopf et al .' 2004), groups (Fukumizu et al. 



20091 ), compact manifolds (Wendland 2005 Chapter 17), 
and other domains on which kernels may be defined. 

A number of alternative approaches have been developed to 
perform inference in the continuous-valued non-Gaussian 
setting. |Sudderth et al. ( 2003) proposed an approximate 



belief propagation algorithm for pairwise Markov random 
fields, where the parametric forms of the node and edge 
potentials are supplied in advance, and the messages are 
approximated as mixtures of Gaussians: we refer to this 
approach as Gaussian Mixture BP (this method was in- 
troduced as "nonparametric BP", but it is in fact a Gaus- 
sian mixture approach). Instead of mixtures of Gaussians, 
Ihler & McAllester ( 2009 | l used particles to approximate 
the messages, resulting in the Particle BP algorithm. Both 
Gaussian mixture BP and particle BP assume the potentials 
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to be pre-specified by the user: the methods described are 
purely approximate message update procedures, and do not 
learn the model from training data. By contrast, kernel BP 
learns the model, is computationally tractable even before 
approximations are made, and leads to an entirely differ- 
ent message update formula than the Gaussian Mixture and 
Particle representations. 

A direct implementation of kernel BP has a reason- 
able computational cost: each message update costs 
0(r7i^dniax) when computed exactly, where m is the num- 
ber of training examples and dmax is the maximum degree 
of a node in the graphical model. For massive data sets 
and numbers of nodes, as occur in image processing, this 
cost might still be expensive. Our second contribution is 
a novel constant time approximate message update proce- 
dure, where we express the messages in terms of a small 
number ^ <C m, of representative RKHS basis functions 
learned from training data. Following an initialization cost 
linear in m, the cost per message update is decreased to 
0(£^ciinax)i independent of the number of training points 
TO. Even without these approximate constant time updates, 
kernel BP is substantially faster than Gaussian mixture BP 
and particle BP. Indeed, an exact implementation of Gaus- 
sian mixture BP would have an exponentially increasing 
computational and storage cost with number of iterations. 
In practice, both Gaussian mixture and particle BP require 
a Monte Carlo resampling procedure at every node of the 
graphical model. 

Our third contribution is a thorough evaluation of kernel BP 
against other nonparametric BP approaches. We apply both 
kernel BP and competing approaches to an image denoising 
problem, depth prediction from still images, protein con- 
figuration prediction, and paper topic inference from ci- 
tation networks: these are all large-scale problems, with 
continuous-valued or structured random variables having 
complex underlying probability distributions. In all cases, 
kernel BP performs outstandingly, being orders of magni- 
tude faster than both Gaussian mixture BP and particle BP, 
and returning more accurate results. 

2 Markov Random Fields And Belief 
Propagation 

We begin with a short introduction to pairwise Markov ran- 
dom fields (MRFs) and the belief propagation algorithm. 
A pairwise Markov random field (MRF) is defined on an 
undirected graph Q := (V, £) with nodes V := {1, . . . ,n} 
connected by edges in £. Each node s e V is associated 
with a random variable Xg on the domainA" (we assume 
a common domain for ease of notation, but in practice 
the domains can be different), and Tg :~ {t\{s,t) £ £} 
is the set of neighbors of node s with size dg := Ir^j. 
In a pairwise MRF, the joint distribution of the variables 
X :— {Xi, . . . , ^|v|} is assumed to factorize according to 
a model P(X) = ^ ll(s,t)^£ ^st{Xg,Xt) H.^v ^^(^«)' 



where 5's(Xs) and "ifstiXsjXt) are node and edge poten- 
tials respectively, and Z is the partition function that nor- 
malizes the distribution. 

The inference problem in an MRF is defined as calculat- 
ing the marginals P(Xs) for nodes s e V and F{Xs,Xt) 
for edges (s,t) G £. The marginal P{Xs) not only pro- 
vides a measure of uncertainty of Xg, but also leads to a 
point estimate x* := argmaxP(Xs). Belief Propagation 
(BP) is an iterative algorithm for performing inference in 
MRFs ([Pearl 1988 ). BP represents intermediate results of 
marginalization steps as messages passed between adjacent 
nodes: a message mts from i to s is calculated based on 
messages m„t from all neighboring nodes u of t besides 
s, i.e.. 



{Xs)= f ^st{Xs,Xt)<it{Xt)Y[mut{Xt)dXt. (1) 



Note that we use Y[u\s denote Y[ 



luert\s 



where it is un- 



derstood that the indices range over all neighbors u of t ex- 
cept s. This notation also applies to operations other than 
the product. The update in ([T]) is iterated across all nodes 
until a fixed point, to^^, for all messages is reached. The re- 
sulting node beliefs (estimates of node marginals) are given 
by B(X,)cxvI/,(X,) liter, 

For acyclic or tree-structured graphs, BP results in node 
beliefs M{Xs) that converge to the node marginals V{Xs)- 
This is generally not true for graphs with cycles. In many 
applications, however, the resulting loopy BP algorithm 



exhibits excellent empirical performance (Murphy et al. 



1999J). Several theoretical studies have also provided in- 



sight into the approximations made by loopy BP, partially 



justifying its application to graphs with cycles ( Wainwright 
|& Jordan|[200^|Yedidia et al.||200T] ). 

The learning problem in MRFs is to estimate the node and 
edge potentials, which is often done by maximizing the ex- 
pected log-likelihood Ex~p*(x)[logP(^)] of the model 
¥{X) with respect to the true distribution F*{X). The re- 
sulting optimization problem usually requires solving a se- 



quence of inference problems as an inner loop (Koller & 
Friedman"} |2009[ l ; BP is often deployed for this purpose. 



3 Properties of Belief Propagation 

Our goal is to develop a nonparametric belief propagation 
algorithm, where the potentials are nonparametric func- 
tions learned from data, such that multimodal and other 
non-Gaussian statistical features can be captured. Most 
crucially, these potentials must be represented in such a 
way that the message update in ([T]l is computationally 
tractable. Before we go into the details of our kernel BP 
algorithm, we will first explain a key property of BP, which 
relates message updates to conditional expectations. When 
the messages are RKHS functions, these expectations can 
be evaluated efficiently. 
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Yedidia et al. ( 2001) ) showed BP to be an iterative algorithm 
for minimizing the Bethe free energy, which is a variational 
approximation to the log-partition function, \ogZ, in the 
MRF model P(X). The beliefs are fixed points of BP al- 
gorithm if and only if they are zero gradient points of the 
Bethe free energy. In Section 5 of the Appendix, we show 
maximum likelihood learning of MRFs using BP results in 
the following equality, which relates the conditional of the 
true distribution, the learned potentials, and the fixed point 
messages, 

F*{Xt\Xs) = ^ — (2) 



where P*(Xs) and m^g{Xs) are assumed strictly positive. 
[Wainwright et al.| ( |2003| Section 4) derived a similar rela- 
tion, but for discrete variables under the exponential fam- 
ily setting. By contrast, we do not assume an exponential 
family model, and our reasoning applies to continuous vari- 
ables. A further distinction is that Wainwright et al. spec- 
ify the node potential 5's(X<;) = ¥*{Xs) and edge poten- 
tial ^iX,,Xt) = V*{Xs,Xt)¥*{Xs)-^V*{Xt)-\ which 
represent just one possible choice among many that satis- 
fies (|2|. Indeed, we next show that in order to run BP for 
subsequent inference, we do not need to commit to a par- 
ticular choice for 'i's{Xs) and 'i>{Xs, Xt), nor do we need 
to optimize to learn *s(Xs) and '^{Xs,Xt). 

We start by dividing both sides of ([Tji by m^^(Xs), and 
introducing 1 = H 



s{Xs 



mUXs 



^st{x,,Xt)^t{Xt)Uu\s^*utiXt) 



X 



^UXs) 



We next substitute the BP fixed point relation (j2]) into ([3]l, 



six,)- 



to 



and reparametrize the messages nitsiXs) ^ 
obtain the following property for BP updates (see Section 
6 in the Appendix for details): 

Property I If we learn an MRF using BP and subse- 
quently use the learned potentials for inference, BP updates 
can be viewed as conditional expectations. 



mts{Xs 



X 



^\Xt\X,)T\ m,,t{Xt)dXt 



E 



Xt\X, 



n 



u\s 



■u\s 

mut{Xt) 



(4) 



Using similar reasoning, the node beliefs on convergence 
of BP take the form M{X,) cx V*{X,) Uter, rn*^iXs). In 
the absence of external evidence, a fixed point occurs at the 
true node marginals, i.e., M{Xs) oc ¥*{Xs) for all s £ V. 
Typically there can be many evidence variables, and the 
belief is then an estimate of the true conditional distribution 
given the evidence. 

The above property of BP immediately suggests that if be- 
lief propagation is the inference algorithm of choice, then 
MRFs can be learned very simply: given training data 
drawn from ¥*{X), the empirical conditionals P{Xt\Xs) 



are estimated (either in parametric form, or nonparametri- 
cally), and the conditional expectations are evaluated us- 
ing these estimates. Evidence can also be incorporated 
straightforwardly: if an observation Xf is made at node t, 
the message from t to its neighbor s is simply the empirical 
likelihood function mts{Xs) oc P{xt\Xs), where we use 
lowercase to denote observed variables with fixed values, 
and capitalize unobserved random variables. 

With respect to kernel belief propagation, our key insight 
from Property [T] however, is that we need not explicitly 
recover the empirical conditionals V{Xt\Xs) as an inter- 
mediate step, as long as we can compute the conditional 
expectation directly. We will pursue this approach next. 

4 Kernel Belief Propagation 

We develop a novel kernelization of belief propagation, 
based on Hilbert space embeddings of conditional distri- 



butions (Song et al. 2009 1, which generalizes an earlier 



kernel algorithm for exact inference on trees ( Song et al. 



20101. As might be expected, the kernel implementation 
of the BP updates in Q is nearly identical to the ear- 
lier tree algorithm, the main difference being that we now 
consider graphs with loops, and iterate until convergence 
(rather than obtaining an exact solution in a single pass). 
This difference turns out to have major implications for the 
implementation: the earlier solution of Song et al. is poly- 
nomial in the sample size, which was not an issue for the 
the smaller trees considered by Song et al., but becomes 
expensive for the large, loopy graphical models we address 
in our experiments. We defer the issue of efficient imple- 
mentation to Section [5] where we present a novel approxi- 
mation strategy for kernel BP which achieves constant time 
message updates. 

In the present section, we will provide a detailed deriva- 
tion of kernel BP in accordance with [Song et al. (2010 1. 
While the immediate purpose is to make the paper self- 
contained, there are two further important reasons: to pro- 
vide the background necessary in understanding our effi- 
cient kernel BP updates in Section |5] and to demonstrate 
how kernel BP differs from the competing Gaussian mix- 
ture and particle based BP approaches in Section|6] (which 
was not addressed in earlier work on kernel tree graphical 
models). 

4.1 Message Representations 

We begin with a description of the properties of a message 
mut{xt), given it is in the reproducing kernel Hilbert space 
(RKHS) T of functions on the separable metric space X 



( |Aronszajn|[T950|[Sch6lkopf & Smola||2002) . As we will 
see, the advantage of this assumption is that the update pro- 
cedure can be expressed as a linear operation in the RKHS, 
and results in new messages that are likewise RKHS func- 
tions. The RKHS J- is defined in terms of a unique pos- 
itive definite kernel k{xsTx'^) with the reproducing prop- 
erty {mts{-),k{xs,-))jr = mtsixs), where k{xs,-) indi- 
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cates that one argument of the kernel is fixed at Xs- Thus, 
we can view the evaluation of message rats at any point 
Xg € A" as a linear operation in F: we call k{xs, •) 
the representer of evaluation at Xg, and use the shorthand 
k{xs,-) = (l){xs). Note that fc(2;s,2;^) = 0(x'J)_^; 
the kernel encodes the degree of similarity between Xg and 
x'g. The restriction of messages to RKHS functions need 
not be onerous; on compact domains, universal kernels 



(in the sense of Steinwart 2001 1 are dense in the space of 
bounded continuous functions (e.g., the Gaussian RBF ker- 
nel k{xs,x'g) = exp(— (7 \\xs — x'J\'^) is universal). Ker- 
nels may be defined when dealing with random variables 
on additional domains, such as strings, graphs, and groups. 

4.2 Kernel BP Message Updates 

We next define a representation for message updates, un- 
der the assumption that messages are RKHS functions. 
For simplicity, we first establish a result for a three node 
chain, where the middle node t incorporates an incom- 
ing message from u, and then generates an outgoing mes- 
sage to s (we will deal with multiple incoming messages 
later). In this case, the outgoing message mts{xs) eval- 
uated at Xs simplifies to mts{xs) = Exda; 
Under some regularity conditions for the integral, we can 
rewrite message updates as inner products, rats{xs) — 



E 



Xt\xA{^ut,(l){Xt))j:] 



{ 



m.at,Kxt\x, 



ing the reproducing property of the RKHS. We refer to 
P'Xt\xs '■= ^Xt\xs['t'{Xt)] E T as the feature space em- 
bedding of the conditional distribution P{Xt\xs)- If we 
can estimate this quantity directly from data, we can per- 
form message updates via a simple inner product, avoiding 
a two-step procedure where the conditional distribution is 
first estimated and the expectation then taken. 

An expression for the conditional distribution embedding 
was proposed by 'S ong et al.| ( [2009 '). We describe this ex- 
pression by analogy with the conditioning operation for a 
Gaussian random vector z ~ A/^(0, C), where we partition 

z = {zi ,zj)'^ such that zi e and Z2 € R"^' . Given the 
covariances Cn := E[ziZ]'^] and C12 := E[zizJ], we can 
write the conditional expectation E[Zi|z2] — (^12(^22^ -^2- 
We now generalize this notion to RKHSs. Followingl Fuku-| 
[mizu et al. (2004) , we define the covariance operator Cx,Xt 
which allows us to compute the expectation of the prod- 
uct of function /(X,) and g{Xt), i.e. Ex^x, [f{Xs)g{Xt)], 
using linear operation in the RKHS. More formally, let 
Cx^Xt : J- ^ J- such that for all f,g,hE T, 
Ex^xA!{Xs)g(Xt)\ = (/, Ex,x, \<\>{Xs)®mt)\9)T 
= (/, (5) 
where we use tensor notation (/ ® g)h — f {g, h)jr . This 
can be understood by analogy with the finite dimensional 
case: if x,y, z £ M"^, then {xy^)z — x{y^ z); furthermore, 
{x'^x'){y'^y'){z'^ z') = (g) y (g) z, (g) (g) z')jj,rf3 
given x,y, z,x' ,y' , z' e M''. Based on covariance op- 
erators. Song et al. define a conditional embedding op- 
erator which allow us to compute conditional expecta- 



tions Ex^\x^ [f{Xt)] as linear operations in the RKHS. Let 



Xt\X, 



C 



XtX, 



C^y^ such that for all / G J", 



Ex.|.J/(^t)] = (/, ^x^xX<t>{Xt))^ = (/, Mx.ix.)^ 
= (/, (6) 
Although we used the intuition from the Gaussian case in 
understanding this formula, it is important to note that the 
conditional embedding operator allows us to compute the 
conditional expectation of any f E regardless of the 
distribution of the random variable in feature space (aside 
from the condition that h{xs) := Ext\xs [f{Xt)] is in the 
RKHS on Xg, as noted by Song et al.). In particular, we 
do not need to assume the random variables have a Gaus- 
sian distribution in feature space (the definition of feature 
space Gaussian BP remains a challenging open problem: 
see Appendix, Section 7). 

We can thus express the message update as a linear opera- 
tion in the feature space, 

nitsixs) = {rriut, Uxt\x,<Pixs)) jr ■ 
For multiple incoming messages, the message updates fol- 
low the same reasoning as in the single message case, albeit 



with some additional notational complexity (see also Song 
|et al.j |2010[ ). We begin by defining a tensor product repro 
ducing kernel Hilbert space H :— under which 

the product of incoming messages can be written as a sin- 
gle inner product. For a node t with degree dt — \Tt\, the 
product of incoming messages m„t from aU neighbors ex- 
cept s becomes an inner product in T-L, 



where £,{Xt) 
becomes 

■mts{xs) 



u\s 



u\. 



mut 



^{Xt) 



(7) 

^u\s 't'i-^t)- The message update (|4|) 



E 



Xt\x 



mt)] 



(8) 



By analogy with (j6]l, we can define the conditional embed- 
ding operator for the tensor product of features, such that 



U 



> T'^ satisfies 
:= E, 



l^xf\x, — {i{Xt)\ ^U^fy^M^'^y 

As in the single variable case, U^n,,^ is defined in terms of 



a covariance operator C 



xfx. 



Ex,xAi{Xt)®^{Xs)] 
in the tensor space, and the operator Cx^x^- The operator 
^x® \x takes the feature map (f){xg) of the point on which 
we condition, and outputs the conditional expectation of 
the tensor product feature £,{Xt). Consequently, we can 
express the message update as a linear operation, but in a 
tensor product feature space. 



mtsixs) 



u\s 



u 



xf\x, 



(f){Xs 



(10) 



The belief at a specific node s can be computed as 
'^*iXs) Yiuer, rUusiXs) where the true marginal ¥*{Xr) 
can be estimated using Parzen windows. If this is unde- 
sirable (for instance, on domains where density estimation 
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cannot be performed), the belief can instead be expressed 
as a conditional embedding operator ( |Song et aL 2010 1. 

4.3 Learning Kernel Graphical Models 

Given a training sample of m pairs a;*)}™ 
drawn i.i.d. from P*(Xf, Xg), we can represents messages 
and their updates based purely on these training examples. 
We first define feature matrices $ — {(j){xl), . . . , 0(a;"')), 
T = . . . , HxT)) and a>« = . . . , e(xr)), 

and corresponding kernel matrices K = and L = 

T^T. The assumption that messages are RKHS functions 
means that messages can be represented as linear combina- 
tions of the training features $, i.e., m„t = ^f3ut, where 
/3„t e M™. On this basis, |Song et al.| ( |2009l ) propose a 
direct regularized estimate of the conditional embedding 
operators from the data. This approach avoids explicit con- 
ditional density estimation, and directly provides the tools 



The computational complexity of the finite sample BP up- 



needed for computing the RKHS message updates in ( 10 1. 
Following this approach, we first estimate the covariance 



operators Cx,x^ = ;^*T ' , C 



xfx. 



-$«^T' and 



CxaXs — ^"""T'"^' ™d obtain an empirical estimate of the 
conditional embedding operator. 



+\mI)-'T 



(11) 



where A is a regularization parameter. Note that we need 
not compute the feature space covariance operators explic- 
itly: as we will see, all steps in kernel BP are carried out 
via operations on kernel matrices. 

We now apply the empirical conditional embedding op- 
erator to obtain a finite sample message update for ( [T0| . 
Since the incoming messages m„t can be expressed as 
rhut — ^(3ut^ the outgoing message fhts at Xg is 



^ $«(L + Am/)-^T>(x,) 

_^ I H 

= (0„y,^^"*) (i + Am/)-iT^0(x,) (12) 

where is the elementwise vector product. If we define 
Pts = {L + Am/)"i(0^^^ Kl3ut), then the outgoing mes- 
sage can be expressed as fhts — ^Pts- In other words, 
given incoming messages expressed as linear combinations 
of feature mapped training samples from Xt, the outgo- 
ing message will likewise be a weighted linear combination 
of feature mapped training samples from Xg. Importantly, 
only ni mapped points will be used to express the outgoing 
message, regardless of the number of incoming messages 
or the number of points used to express each incoming mes- 
sage. Thus the complexity of message representation does 
not increase with BP iterations or degree of a node. 

Although we have identified the model parameters with 
specific edges (s, t), our approach extends straightfor- 
wardly to a templatized model, where parameters are 
shared across multiple edges (this setting is often natural 
in image processing, for instance). Empirical estimates of 
the parameters are computed on the pooled observations. 



date in ( 12 1 is polynomial in term of the number of training 



samples. Assuming a preprocessing step of cost 0(m'^) to 
compute the matrix inverses, the update for a single mes- 
sage costs 0(m^(i,„ax) where dmux is the maximum degree 
of a node in the MRF. While this is reasonable in compari- 
son with competing nonparametric approaches (see Section 
|6]and the experiments), and works well for smaller graphs 
and trees, a polynomial time update can be costly for very 
large m, and for graphical models with loops (where many 
iterations of the message updates are needed). In Section 
|5] we develop a message approximation strategy which re- 
duces this cost substantially. 

5 Constant Time Message Updates 

In this section, we formulate a more computationally ef- 
ficient alternative to the full rank update in (12i. Our 



goal is to limit the computational cost of each update to 
0(£^(imax) where £ m. We will require a one-off pre- 
processing step which is linear in to. This efficient message 
passing procedure makes kernel BP practical even for very 
large graphical models and/or training set sizes. 

5.1 Approximating Feature Matrices 

The key idea of the preprocessing step is to approxi- 
mate messages in the RKHS with a few informative ba- 
sis functions, and to estimate these basis functions in a 
data dependent way. This is achieved by approximating 
the feature matrix $ as a weighted combination of a sub- 
set of its columns. That is, $ w ^xWt, where I := 
{ii, . . . C {1, . . . , m}, Wt has dimension £ x to, and 
— (^(^^i^ ), • ■ • , 'Pi^Y)) is a submatrix formed by tak- 
ing the columns of $ corresponding to the indices in I. 
Likewise, we approximate T w T jWg, assuming {Jl ~ £ 
for simplicity. We thus can approximate the kernel matrices 
as low rank factorizations, i.e., K « Wj KixWt and L = 
WjLjjWs, where Ku := and Ljj = TJTj. 

A common way to obtain the approximation $ w ^xWt 
is via a Gram-Schmidt orthogonalization procedure in fea- 
ture space, where an incomplete set of £ orthonormal ba- 
sis vectors Q := {q}, . . . ,ql) is constructed from a greed- 
ily selected subset of the data, chosen to minimize the 
reconstruction error ( |Shawe-Taylor & Cristianim} |2004| 
p. 126). The original feature matrix can be approximately 
expressed using this basis subset as <i> w QR where R G 
jj^xm ^j.g coefficients under the new basis. There is 
a simple relation between Q and the chosen data points 
$x, i-e., Q = where Ri is the submatrix formed 

by taking the columns of R corresponding to I. It fol- 
lows that Wt = Rj^R. All operations involved in Gram- 
Schmidt orthogonalization are linear in feature space, and 
the entries of R can be computed based solely on kernel 
values k{xt,Xt). The cost of performing this orthogonal- 
ization is 0{m£'^). The number I of chosen basis vectors 
is inversely related to the approximation error or residual 
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e — max,; ||(/)(a;t) — ^iVF/ ||jr (W^ denotes column i of s is a weighted Hnear combination of the £ vectors in T j. 



Wt). In many cases of interest (for instance, when a Gaus- 
sian RBF kernel is used), a small £ m will be sufficient 
to obtain a small residual e for the feature matrix, due to 



the fast decay of the eigenspectrum in feature space ( Bach 
& Jordan! |2^ , Appendix C). 



5.2 Approximating Tensor Features 

The approximations $ w ^iWt and T « T jWg, and as- 
sociated low rank kernel approximations are insufficient for 
a constant time approximate algorithm, however In fact, 
directly applying these results will only lead to a linear time 
approximate algorithm: this can be seen by replacing the 
kernel matrices in ( [T2] i by their low rank approximations. 

To achieve a constant approximate update, our strategy is 
to go a step further: in addition to approximating the kernel 
matrices, we further approximate the tensor product feature 
matrix in equation $^ « ^f,Wf {Wf G M^'^™)- 
Crucially, the individual kernel matrix approximations ne- 
glect to account for the subsequent tensor product of these 
messages. By contrast, our proposed approach also approx- 
imates the tensor product directly. The computational ad- 
vantage of a direct tensor approximation approach is sub- 
stantial in practice (a comparison between exact kernel BP 
and its constant and linear time approximations can be 
found in Section 3 of the Appendix) . 

The decomposition procedure for tensor <I>® « <^'^,Wf' 
follows exactly the same steps as in the original feature 
space, but using the kernel k'^*~^{xt, x'^), and yielding an 
incomplete orthonormal basis in the tensor product space. 
In general the index sets I' 7^ I, meaning they select dif- 
ferent training points to construct the basis functions. Fur- 
thermore, the size £' of I' is not equal to the size ^ of I 
for a given approximation error e. Typically £' > £, since 
the tensor product space has a slower decaying spectrum, 
however we will write £ in place of £' to simplify notation. 

5.3 Constant Time Approximate Updates 

We now compute the message updates based on the various 
low rank approximations. The incoming messages and the 
conditional embedding operators become 

(g) , TOnt « (g) , ^lWtl3ut, (13) 

u\s u\s 

^xf|x>(2^.)~^l'W^tsT5</>(x3), (14) 

where Wts := W^iWjLjjWs + XmI)--'^Wj . If we 
reparametrize the messages m„t as m„t = ^xctut where 
c^ut '■= VVtPut, we can express the message updates for 

nitsixs) as 



O ^ feant ) WstT]^cb{xs), (15) 



where Kx'i denotes the submatrix of K with rows in- 
dexed I' and columns indexed I. The outgoing mes- 
sage mts can also be reparametrized as a vector ats ~ 
Wjt (^(z)u\s ^i'io:utj ■ In short, the message from t to 



We note that Wts can be computed efficiently prior 
to the message update step, since W^{Wj LjjWs + 
\mI)-^Wj = Wf>Wj{WsWj + \mL-\)-^L-^j via 
the Woodbury expansion of the matrix inverse. In the latter 
form, matrix products M^sT4^7 and WfWj cost 0{£^m); 
the remaining operations (size £ matrix products and inver- 
sions) are significantly less costly at OiJ?). This initiaUza- 
tion cost of 0{£? + £?rn) need only be borne once. 



The cost of updating a single message mts in < 15 1 be- 
comes 0(^^(iniax) where dmax is the maximum degree of 
a node. This also means that our approximate message up- 
date scheme will be independent of the number of training 
examples. With these approximate messages, the evalua- 
tion of the belief B(a;r) of a node r at can be carried out 
in time 0(£dmax)- 

Finally, approximating the tensor features introduces ad- 
ditional error into each message update. This is caused 
by the difference between the full rank conditional em 
bedding operator iA^% v in 

terpart 



^ ill (jTT|) and its low rank coun- 
in Under suitable conditions, this 



difference is boundedoy the feature approximation error 



e. I.e., llZ^xf |x, ~'^xf\xJ\HS 
Section 8 of the Appendix for details). 

6 Gaussian Mixture And Particle BP 

We briefly review two state-of-the-art approaches to non- 



< 2e(A-i + A-3/2) (see 



parametric belief propagation: Gaussian Mixture BP ( Sud- 
'dert h et al.| [2003] ) and Particle BP ( Diler & M cAllester| 
,2009 1. By contrast with our approach, we must provide 



these algorithms in advance with an estimate of the condi- 
tional density ¥* {Xt\Xs), to compute the conditional ex- 
pectation in Q. For Gaussian Mixture BP, this conditional 
density must take the form of a mixture of Gaussians. We 
describe how we learn the conditional density from data, 
and then show how the two algorithms use it for inference. 

A direct approach to estimating the conditional density 
P*{Xt\Xs) would be to take the ratio of the joint empir- 
ical density to the marginal empirical density. The ratio 
of mixtures of Gaussians is not itself a mixture of Gaus- 
sians, however, so this approach is not suitable for Gaus- 
sian Mixture BP (indeed, message updates using this ratio 
of mixtures would be non-trivial, and we are not aware of 
any such inference approach). We propose instead to learn 
■^*{Xt\Xs) directly from training data following Sugiyama 



et al.| ( [20T0,) , who provide an estimate in the form of a mix 
ture of Gaussians (see Section 1 of the Appendix for de 
tails). We emphasize that the updates bear no resemblance 



to our kernel updates in ( 12 1, which do not attempt density 
ratio estimation. 

Given the estimated P(Xt|Xs) as input, each nonparamet- 
ric inference method takes a different approach. Gaussian 
mixture BP assumes incoming messages to be a mixture of 
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Figure 1: Average denoising error and runtime of kernel BP com- 
pared to discrete, Gaussian mixture and particle BP over 10 test 
images with varying numbers of rings. Runtimes are plotted on a 
logarithmic scale. 

b Gaussians. The product of dt incoming messages to node 
t then contains Gaussians. This exponential blow-up 
is avoided by replacing the exact update with an approxi- 
mation. An overview of approximation approaches can be 
found in |Bickson etal.| ( 201 1 j ; we used an efficient KD-tree 
method of 



Diler et al 



2003| l for performing the approxi- 



mation step. Particle BP represents the incoming messages 
using a common set of particles. These paiticles must be 
re-drawn via Metropolis-Hastings at each node and BP it- 
eration, which is costly (although in practice, it is sufficient 
to resample periodically, rather than strictly at every iter- 
ation). By contrast, our updates are simply matrix-vector 
products. See Appendix for further discussion. 

7 Experiments 

We performed four sets of experiments. The first two were 
image denoising and depth prediction problems, where we 
show that kernel BP is superior to discrete, Gaussian mix- 
ture and particle BP in both speed and accuracy, using a 



GraphLab implementation of each (Low et al. 2010 1. The 
remaining two experiments were protein structure and pa- 
per category prediction problems, where domain-specific 
kernels were crucial (for the latter see Appendix, Sec. 4). 

Image denoising: In our first experiment, the data con- 
sisted of grayscale images of size 100 x 100, resembling a 
sunset (Figure [TJ a)). The number of colors (gray levels) in 
the images ranged across 10, 25, 50, 75, 100, 125, 150, 175, 
200, 225 and 250, with gray levels varying evenly from to 
255 from the innermost ring of the sunset to the outermost. 
As we increased the number of colors, the grayscale transi- 
tion became increasingly smooth. Our goal was to recover 
the original images from noisy versions, to which we had 
added zero mean Gaussian noise with a = 30. We com- 
pared the denoising performance and runtimes of discrete, 
Gaussian mixture, particle, and kernel BP. 

The topology of our graphical model was a grid of hid- 
den noise-free pixels with noisy observations made at each. 
The maximum degree of a node was 5 (four neighbours 
and an observation), and we used a template model where 



both the edge potentials and the likelihood functions were 
shared across all variables. We generated a pair of noise- 
free and noisy images as training data, at each color num- 
ber. For kernel BP, we learned both the likelihood function 
and the embedding operators nonparametrically from the 
data. We used a Gaussian RBF kernel k{x, x'), with ker- 
nel bandwidth set at the median distance between training 
points, and residual e = 10""^ as the stopping criterion for 
the feature approximation (see definition of e in Section 



5.1 1. For discrete, Gaussian mixture, and particle BP, we 
learned the edge potentials from data, but supplied the true 
likelihood of the observation given the hidden pixel (i.e., a 
Gaussian with standard deviation 30). This gave compet- 
ing methods an important a priori advantage over kernel 
BP: in spite of this, kernel BP still outperformed compet- 
ing approaches in speed and accuracy. 

In Figure [TJc) and (d), we report the average denoising 
performance (RMSE: root mean square eiTor) and runtime 
over 30 BP iterations, using 10 independently generated 
noisy test images. The RMSE of kernel BP is significantly 
lower than Gaussian mixture and particle BP for all num- 
bers of colors. Although the RMSE of discrete BP is about 
the same as kernel BP when the number of colors is small, 
its performance becomes worse than kernel BP as the num- 
ber of colors increases beyond 100 (despite discrete BP 
receiving the true observation likelihood in advance). In 
terms of speed, kernel BP has a considerable advantage 
over the alternatives: the runtime of KBP is barely affected 
by the number of colors. For discrete BP, the scaling is ap- 
proximately square in the number of colors. For Gaussian 
mixture and particle BP, the runtimes are orders of magni- 
tude longer than kernel BP, and are affected by the variabil- 
ity of the resampling algorithm. 

Predicting depth from 2D images: The prediction of 3D 
depth information from 2D image features is a difficult in- 
ference problem, as the depth may be ambiguous: sim- 
ilar features can occur at different depths. This creates 
a multimodal depth distribution given the image feature. 
Furthermore, the marginal distribution of the depth can it- 
self be multimodal, which makes the Gaussian approxima- 
tion a poor choice (see Figure |2|b)). To make a spatially 
consistent prediction of the depth map, we formulated the 
problem as an undirected graphical model, where a depth 
variable j/j G M was associated with each patch of an im- 
age, and these variables were connected according to a 2D 
grid topology. Each hidden depth variable was linked to 
an image feature variable Xi £ M^^-^ for the correspond- 
ing patch. This formulation resulted in a graphical model 
with 9, 202 = 107 x 86 continuous depth variables, and 
a maximum node degree of 5. Due to the way the images 
were taken (upright), we used a templatized model where 
horizontal edges in a row shared the same potential, ver- 
tical edges at the same height shared the same potential, 
and patches at the same row shared the same likelihood 



Kernel Belief Propagation 




H histogram 
■ — gaussian 
— nonparametric 

i.< ■/ML.ki 











results of pointwise MRF reported in Saxena et al. ( 2009[ l. 
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Figure 2: Average depth prediction error and runtime of kernel 
BP compared to discrete, Gaussian mixture and particle BP over 
274 images. Runtimes are on a logarithmic scale. 
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Figure 3: Average angle prediction accuracy of kernel versus 
particle BP in the protein folding problem. 

function. Both the edge potentials between adjacent depth 
variables and the likelihood function between image fea- 
ture and depth were unknown, and were learned from data. 

We used a set of 274 images taken on the Stanford campus, 



including both indoor and outdoor scenes (Saxena et al. 
|2009[ ). Images were divided into patches of size 107 by 86, 
with the corresponding depth map for each patch obtained 
using 3D laser scanners (e.g., Figure|2|a)). Each patch was 
represented by a 273 dimensional feature vector, which 
contained both local features (such as color and texture) 
and relative features (features from adjacent patches). We 
took the logarithm of the depth map and performed learning 
and prediction in this space. The entire dataset contained 
more than 2 milUon data points (107 x 86 x 274). We 
applied a Gaussian RBF kernel on the depth information, 
with the bandwidth parameter set to the median distance 
between training depths, and an approximation residual of 
e = 10^"^. We used a linear kernel for the image features. 

Our results were obtained by leave-one-out cross valida- 
tion. For each test image, we ran discrete, Gaussian mix- 
ture, particle, and kernel BP for 10 BP iterations. The aver- 
age prediction error (MAE: mean absolute error) and run- 
time are shown in Figures |2|c) and (d). Kernel BP pro- 
duces the lowest error (MAE=0. 145) by a significant mar- 
gin, while having a similar runtime to discrete BP. Gaussian 
mixture and particle BP achieve better MAE than discrete 
BP, but their runtimes are two order of magnitude slower. 
We note that the error of kernel BP is shghtly better than the 



Protein structure prediction: Our final experiment inves- 
tigates the protein folding problem. The folded configu- 
ration of a protein of length n is roughly determined by a 
sequence of angle pairs {{6.i,u!i)}2^^, each specific to an 
amino acid position. The goal is to predict the sequence 
of angle pairs given only the amino acid sequence. The 
two angles {0i,uji) have ranges [0,180] and (-180,180] 
respectively, such that they correspond to points on the unit 
sphere S^. Kernels yield an immediate solution to infer- 



ence on these data: [Wendland (2005 Theorem 17.10) pro- 
vides a sufficient condition for a function on to be posi- 
tive definite, satisfied by k{x, x') := exp(o' (x, x')), where 
(x, x') is the standard inner product between Euclidean co- 
ordinates. Given the data are continuous, multimodal, and 
on a non-Euclidean domain (Figure [3j a)), it is not obvious 
how Gaussian mixture or discrete BP might be applied. We 
therefore focus on comparing kernel and particle BP. 

We obtained a collection of 1 , 400 proteins with length 
larger than 100 from PDB. We first ran PSI-BLAST to gen- 
erate the sequence profile (a 20 dimensional feature for 
each amino acid position), and then used this profile as 
features for predicting the folding structure ( |Jones| |1999) l. 
The graphical model was a chain of connected angle pairs, 
where each angle pair was associated with a 20 dimen- 
sional feature. We used a linear kernel on the sequence fea- 
tures. For the kernel between angles, the bandwidth param- 
eter was set at the median inner product between training 
points, and we used the approximation residual e = 10^'^. 
For particle BP, we learned the nonparametric potentials 
using exp(CT (x, x')) as the basis functions. 

In Figure |3jb), we report the average prediction accuracy 
(Mean Cosine Similarity between the true coordinate x 
and the predicted x' , i.e., {x,x')) over a 10-fold cross- 
validation process. In this case, kernel BP achieves a sig- 
nificantly better result than particle BP while running much 
faster (runtimes not shown due to space constraints). 

8 Conclusions and Further Work 

We have introduced kernel belief propagation, where the 
messages are functions in an RKHS. Kernel BP performs 
learning and inference on challenging graphical models 
with structured and continuous random vaiiables, and is 
more accurate and much faster than earlier nonparametric 
BP algorithms. A possible extension to this work would 
be to kernelize tree-reweighted belief propagation ( WaiiiJ 
|wright et al.||2003) . The convergence of kernel BP is a fur- 
ther challenging topic for future work (Ihler et al. 2005| l. 
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Supplementary to Kernel Belief Propagation 

Section [T] contains a review of Gaussian mixture BP and particle BP, as well as a detailed explanation of our strategy for 
learning edge potentials for these approaches from training data. Section|2] provides parameter settings and experiment 
details for particle BP and discrete BP, in the synthetic image denoising and depth reconstruction experiments. Section 
[3] contains a comparison of two different approximate feature sets: low rank approximation of the tensor features and 
low rank approximation of the individual features alone. Section]?] is an experiment on learning paper categories using 
citation networks. Sections]5]and|6]demonstrate the optimization objective of locally consistent BP updates, and provide a 
derivation of these updates in terms of the conditional expectation. Section]?] discusses the kemelization of Gaussian BP. 
Section ]8] gives the error introduced by low rank approximation of the messages. 



1 Gaussian Mixture and Particle BP 



We describe two competing approaches for nonparametric belief propagation: Gaussian mixture BP, originally known as 
non-parametric BP ( Sudderth et aL] |2003| l, and particle BP (Ihler & McAllester, 2009) . For these algorithms, the edge 
potentials 'i'{xs,xt), self-potentials tp{xt), and evidence potentials ^{xt,yt) niust be provided in advance by the user 
Thus, we begin by describing how the edge potentials in Section 2 of the main document may be learned from training 
data, but in a form applicable to these inference algorithms: we express P{xt\xs) as a mixture of Gaussians. We then 
describe the inference algorithms themselves. 

In learning the edge potentials, we turn to |Sugiyama et al.| ( |2010[ ), who provide a least-squares estimate of a conditional 
density in the form of a mixture of Gaussians, 

b 

V{v\u) = ^ aiKi{u, v) = a^Ku,v, 

i=l 

where Ki{u, v) is a Gaussian with diagonal covariance centred a|^ {qi,ri). Given a training set {(u-', f^)}™^]^, we obtain 
the coefficients 

-1 



(i? + A/) 



h 



where H := X^JLi Iv '^u^.v^-Zj v'^^^ ^ '■~ Sj=i '^ui,vU A is a regularization coefficient, and [a]+ sets all negative entries 
of its argument to zero (the integral in H can easily be computed in closed form). We emphasize that the Gaussian mixture 
representation takes a quite different form to the RKHS representation of the edge potentials. Finally, to introduce evidence, 
we propose to use kernel ridge regression to provide a mean value of the hidden variable xt given the observation yt, and 
to center a Gaussian at this value: again, the regression function is learned nonparametrically from training data. 

We now describe how these edge potentials are incorporated into Gaussian mixture BP. Assuming the incoming messages 
are each a mixture of b Gaussians, the product of dt such messages will contain fe'^* Gaussians, which causes an exponential 
blow-up in representation size and computational cost. In their original work, Sudderth et al. address this issue using an 
approximation scheme. First, they subsample from the incoming mixture of ft'^' Gaussians to draw b Gaussians, at a 
computational cost of 0{dtTb^) for each node, where r is the number of iterations of the associated Gibbs sampler (see 
their Algorithm 1). The evidence introduced via kernel ridge regression is then incorporated, using a reweighting described 
by their Algorithm 2. Finally, in Algorithm 3, b samples {zj} . are drawn from the reweighted mixture of b Gaussians, 

and for each of these, {x*}^ .^^ are drawn from the conditional distribution Xs\xl arising from the edge potential iIj{xs, Xt) 
(which is itself a Gaussian mixture, learned via the approach of Sugiyama et al.). Gaussians are placed on each of the 
centres {xl}^_^, and the process is iterated. 



In our implementation, we used the more efficient multiscale KD-tree sampling method of Ihler et al. ( |2003| l. We converted 



the Matlab Mex implementation of Ihler (2003) to C-H-, and used GraphLab to execute sampling in parallel with up to 16 
cores. An input parameter to the sampling procedure is e, the level of accuracy. We performed a line search to set e for 
high accuracy, but limited the execution time to be at most 1000 times slower than KBP. 

Finally, we describe the inference procedure performed by Particle BP. In this case, each node t is associated with a set 
of particles {xj}^ drawn i.i.d. from a distribution Wt{xt)- Incoming messages m„j are expressed as weights of the 
particles xj. Unlike Gaussian mixture BP, the incoming messages all share the same set of particles, which removes 



'These centres may be selected at random from the training observations. We denote the mixture kernel by k{u, v) to distinguish it 
from the RKHS kernels used earlier. 
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the need for Parzen window smoothing. The outgoing message nits is computed by summing over the product of these 
weights and the edge and evidence potentials at the particles, yielding a set of weights over samples {a:^s } at node s; the 
procedure is then iterated (see ,Ihler & McAllester 2009 eq. 8). We again implement this algorithm using edge potentials 
computed according to Sugiyama et al. Since an appropriate sample distribution Wt is hard to specify in advance, a 
resampling procedure must be carried out at each BP iteration, to refresh the set of samples at each node and ensure the 
samples cover an appropriate support (this is a common requirement in particle filtering). Thus, each iteration of Particle 
BP requires a Metropolis-Hastings chain to be run for every node, which incurs a substantial computational cost. That 
said, we found that in practice, the resampling could be conducted less often without an adverse impact on performance, 
but resulting in major improvements in runtime, as described in Section |2]below. See ( Ihler & McAllester| |2009| Section 
6) for more detail. 



2 Settings for Discrete and Particle BP 
2.1 Depth Reconstruction from 2-D Images 
2.1.1 Discrete BP 



The log-depth was discretized into 30 bins, and edge parameters were selected to achieve locally consistent Loopy BP 
marginals using the technique described in IWainwright et al. (2003 1. Empirically, finer discretizations did not improve 
resultant accuracy, but increased runtime significantly. We used the Splash scheduling of Gonzalez et al. ( 2009| l since it 
provided the lowest runtime among all tested schedulings. 



2.1.2 Particle BP 



The particle belief propagation implementation was particularly difficult to tune due to its excessively high runtime. Theo- 
retically, results comparable to the Kernel BP method were attainable. However in practice, the extremely high cost of the 
resampling phase on large models meant that only a small number of particles could be maintained if a reasonable runtime 
was to be achieved on our evaluation set of 274 images. 

Ultimately, we decided to find a configuration which allowed us to complete the evaluation in about 2 machine-days on an 
8-core Intel Nehalem machine; allowing inference on each evaluation image to take 10 minutes of parallel computation. 
For each image, we ran 100 iterations of a simple linear-sweep scheduling, using 20 particles per message, and resampling 
every 10 iterations. Each resampling phase ran MCMC for a maximum of 10 steps per particle. We also implemented 
acceleration tricks where low weight particles (< IE — 7 after normalization) were ignored during the message passing 
process. Empirically this decreased runtime without affecting the quality of results. 



2.2 Synthetic Image Denoising 
2.2.1 Discrete BP 



To simplify evaluation, we permitted a certain degree of "oracle" information, by matching the discretization levels during 
inference with the color levels in the ground-truth image. 

We evaluated combined gradient/IPF + BP methods here to learn the edge parameters. We found that gradient/IPF per- 
formed well when there were few colors in the image, but failed to converge when the number of colors increased into the 
hundreds. This is partly due to the instability of BP, as well as the large number of free parameters in the edge potential. 

Therefore once again, edge potentials were selected using the technique described in |Wainwright et al.| ( |2003| l. This 
performed quite well empirically, as seen in Figure 1(c) (main document). 



2.2.2 Particle BP 



The high runtime of the Particle Belief Propagation again made accuracy evaluation difficult. As before, we tuned the 
particle BP parameters to complete inference on the evaluation set of 110 images in 2 machine days, allowing about 25 
minutes per evaluation image. We ran 100 iterations of 30 particles per message, resampling every 15 iterations. Each 
resampling phase ran MCMC for a maximum of 10 steps per particle. 
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3 Effects of Approximate Message Updates 

In this section, we study how different levels of feature approximation error e affect the speed of kernel BP and the 
resulting performance. Our experimental setup was the image denoising experiment described in Section 5.1 of the main 
document. We note that the computational cost of our constant message update is ©(^'^dmax) where ^ is inversely related 
to the approximation error e. This is a substantial runtime improvement over naively applying a low rank kernel matrix 
approximation, which only results in a linear time update with computational cost 0(£TOd,nax)- In this experiment, we 
varied the feature approximation error e over three levels, i.e. 10^^, 10^^, 10^"^, and compared both speed and denoising 
performance of the constant time update to the linear time update. 

From Figures|4](a) and (c), we can see that for each approximation level, the constant time update achieves about the same 
denoising performance as the linear time update, while at the same time being orders of magnitude faster (by comparing 
Figures |4](b) and (d)). Despite the fact that the constant time update algorithm makes an additional approximation to the 
tensor product features, its denoising performance is not affected. We hypothesize that the degradation in performance is 
largely caused by representing the messages in terms of a small number of basis functions, while the approximation to the 
tensor features introduces little additional degradation. 

Another interesting observation from Figure [4] (d) is that the runtime of constant time kernel BP update increases as the 
number of colors in the image increases. This is mainly due to the increased number of test points as the color number 
increases; and also partially due to the increased rank needed for approximating the tensor features. In Figure|5] we plot 
the rank needed for kernel feature approximation and tensor feature approximation for different numbers of colors and 
different approximation errors e. It can be seen that in general, as we use a smaller approximation error, the rank increases, 
leading to a slight increase in runtime. 

Finally, we compare with kernel belief propagation in the absence of any low rank approximation (KBP Full). Since KBP 
Full is computationally expensive, we reduced the denoising problem to images of size 50 x 50 to allow KBP to finish in 
reasonable time. We only compared on 100 color images, again for reasons of cost. We varied the feature approximation 
error for the constant time and linear time approximation over three levels, 10~^, 10~^, 10^'^, and compared both speed 
and denoising performance of KBP Full versus the constant time and linear time updates. 

The comparisons are shown in Figure |6] We can see from Figure |6|a) that the denoising errors for constant time and 
linear time approximations decrease as we decrease the approximation error e. Although the denoising error of KBP Full 
is slightly lower than constant time approximations, it is a slight increase over the linear time approximation at e = 10^'^. 
One reason might be that the kernel approximation also serves as a means of regularization when learning the conditional 
embedding operator. This additional regularization may have slightly improved the generalization ability of the linear time 
approximation scheme. In terms of runtime (Figure |6|b)), constant time approximation is substantially faster than linear 
time approximation and KBP Full. In particular, it is nearly 100 times faster than the linear time algorithm, and 10000 
times faster than KBP Full. 



4 Predicting Paper Categories 

In this experiment, we predict paper categories from a combination of the paper features and their citation network. Our 
data were obtained by crawling 143,086 paper abstracts from the ACM digital library, and extracting the citation networks 
linking these papers. Each paper was labeled with a variable number of categories, ranging from 1 to 10; there were a total 
of 367 distinct categories in our dataset. For simplicity, we ignored directions in the citation network, and treated it as an 
undirected graph (i.e, we did not distinguish "citing" and "being cited"). The citation network was sparse, with more than 
85% of the papers having fewer than 10 links. The maximum number of links was 450. 

Paper category prediction is a multi-label problem with a large output space. The output is very sparse: the label vectors 
have only a small number of nonzero entries. In this case, the simple one-against-all approach of learning a single predictor 



for each category can become prohibitively expensive, both in training and in testing. Recently, Hsu et al. (2009) proposed 
to solve this problem using compressed sensing techniques: high dimensional sparse category labels are first compressed 
to lower dimensional real vectors using a random projection, and regressors are learned for these real vectors. In the testing 
stage, high dimensional category labels are decoded from the predicted real vectors of the test data using orthogonal 
marching pursuit (OMP). 

For the purposed of the present task, however, the compressed sensing approach ignores information from the citation net- 
work: papers that share categories tend to cite each other more often. Intuitively, taking into account category information 
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Figure 4: Average denoising error and runtime of linear time kernel BP versus constant time kernel BP, using different 
feature approximation errors, over 10 test images with a varying number of image colors. 



from neighboring papers in the citation network should improve the performance of category prediction. This intuition 
can be formalized using undirected graphical models: each paper i contains a category variable yi € {0, 1}^^'^, and these 
variables are connected according to the citation network; each category variable is also connected to a variable Xi corre- 
sponding to the abstract of the paper. In our experiment, we used 9700 stem words for the abstracts, and Xi € M^^"'^ was 
the tf-idf vector for paper i. The graphical model thus contains two types of edge potential, '^{yi, yi) and ^{yi,yk), where 
k G is the neighbor of j according to the citation network. 

It is difficult to learn this graphical model and perform inference on it, since the category variables yi have high cardinality, 
making the marginalization step in BP prohibitively expensive. Inspired by the compressed sensing approach for multilabel 
prediction, we first employed random projection kernels, and then used our kernel BP algorithm. Let A G R<ix367 
a random matrix containing i.i.d. Gaussian random variables of zero mean and variance 1/d. We defined the random 
projection kernel for the category labels to be k{y,y') = {Ay, Ay') — {(j){y), (f>{y')), and used a linear kernel for the 
abstract variables. We ran kernel BP for 5 iterations, since further iterations did not improve the performance. MAP 
assignment based on the belief was performed by finding a unit vector = Ay that maximized the beUef. The sparse 
category labels y were decoded from (j){y) using OMP. 

To measure experimental performance, we performed 10 random splits of the papers, where in each split we used 1/3 of 
the papers for training and the remaining 2/3 for testing. The random splits were controlled in such a way that high degree 
nodes (with degree > 10) in the citation networks always appeared in the training set. Such splitting reflects the data 
properties expected in real-world scenarios: important papers (high degree nodes which indicate either influential papers 
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Figure 5: The rank obtained for kernel feature approximation and tensor feature approximation for different levels of 
approximation error e, for the image denoising problem. 
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Figure 6: (a) Denoising error for the constant time approximate update (KBP) and linear time approximate update (KB- 
PLinear), over three levels of approximation error e — {10^^, 10^^, 10^"^}, versus kernel BP without low rank approxi- 
mation (Full), (b) Runtime corresponding to different approximation schemes. 



or survey papers) are usually labeled, whereas the majority of papers may not have a label; the automatic labeling is mainly 
needed for these less prominent papers. We used recall® /c in evaluating the performance of our method on the test data. 
We compared against the regression technique of Hsu et al. for multilabel prediction, and a baseline prediction using the 
top k most frequent categories. For both our method and the method of Hsu et al., we used a random projection matrix 
with d = 100. 

Results are shown in Figure [7] Kernel BP performs better than multilabel prediction via compressed sensing (i.e., the 
independent regression approach, which ignores graphical model structure) over a range of k values. In particular, for the 
top 10 and 20 predicted categories, kernel BP achieves recall scores of 0.419 and 0.476, respectively, as opposed to 0.362 
and 0.417 for independent regression. 

5 Local Marginal Consistency Condition When Learning With BP 



In this section, we show that fixed points of BP satisfy particular marginal consistency conditions ( 29 1 and ( 30 1 below. As 



we will see, these arise from the fact that we are using a Bethe free energy approximation in fitting our model, and the form 
of the fixed point equations that define the minimum of the Bethe free energy. The material in this section draws from a 



number of references (for instance |Yedidia et al. 2001 2005 Wain wright & Jordan 2008 KoUer & Friedman| |2009| l, but 
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Figure 7: Comparison of kernel BP, multilabel prediction via compressed sensing (regression), and a baseline prediction 
using the top k most frequent categories (mean) for ACM paper category prediction. 



is presented in a form specific to our case, since we are neither in a discrete domain nor using exponential families. 

The parameters of a pairwise Markov random field (MRF) can be learned by maximizing the log-likelihood of the model 
P with respect to true underlying distribution P*. Denote the model by 

(s,t)e£ sev 

where Z := Yl^s t)e£ ^st{Xs,Xt) Yisev ^s{Xs) is the partition function that normaUzes the distribution. The model 
parameters {'i>st{Xs, Xt), 'i's{Xs)} can be estimated by maximizing 

C = Exr^r*ix) [logP(X)] 



E 



X~P*(X) 



(16) 



\og-^st{Xs,Xt) + J2^og-^s{Xs)-\ogZ 

Define '^'stiXs,Xt) := log'i' stiXs, Xt) and $s(^s) := log*s(Xs). Setting the derivatives of £ with respect to 
I (^s , ) , $s ) } to zero, we have 

dC . dlogZ 



^*{Xs,Xt 



d-^st{Xs,Xt) 



0, 



(17) 



(18) 



d-^s{Xs) 

d^s{Xs) d^siXs) 
For a general pairwise MRF on a loopy graph, computing the log-partition function, log Z, is intractable. Following e.g. 
Yedidia et al. ( 2001 2005 1 and Koller & Friedmanl ( |2009| Ch. 11), log Z may be approximated as a minimum of the Bethe 
free energy with respect to a new set of parameters {6st, &s}, 

F{{bsuhs})^ I I h,t{Xs,Xt)[\oghst{Xs,Xt) -\og^ st{Xs,Xt)^s{Xs)^t{Xt)]dXsdXt 



(s,t)e£ 



sev 



b,{X,) [\ogbs{Xs)-log-^s{Xs)]dXs 



(19) 



X 



subject to normalization and marginalization constraints, bs{Xs)dXs = 1, bst{Xs, Xt)dXs ~ bt{Xt). Let F* := 
minjb^t.b^} F (we note that Bethe free energy is not convex, and hence there can be multiple local minima. Our reasoning 
does not require constructing a specific local minimum, and therefore we simply write F*). The zero gradient conditions 
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on the partial derivatives of £ are then approximated as 



Since F is a linear function of |^'st(Xs, Xt), for every fixed {bst,bs}, Danskin's theorem (Bertsekas 

717) gives us a way to compute the partial derivatives of F*. These are 

dF* _dFm„K}) 



1999 



(20) 
(21) 
P- 



d-^stiXs,Xt) d^st{Xs,Xt) 
dF* ^ dFmt,K}) 

d-^siXs) d'^iJsiXs) 



= hUXs.Xt), 
= KiXs), 



(22) 



(23) 



where b*} := argmin^fj^^ j F. Therefore, according to (20l and (21 1, learning a pairwise MRF using the Bethe 
energy variational approximation to the log partition function results in the following matching conditions, 

V*{X,,Xt)^b*,iX,,Xt), (24) 
P*{Xs)^b:iX,). (25) 
We now introduce the notion of belief propagation as a means of finding the minima of the Bethe free energy. This will in 
turn lead to local marginal consistency conditions for learning with BP. Yedidia et al. ( 200 l| l showed that the fixed point of 
F (and therefore the global minimum b*}) must satisfy the relations 

bUXs,Xt)=a^stiX,,Xt)^,iX,)^SJt{Xt) n <.iXs) n <tiXt), (26) 
b^X,) = a^siXs) n <siXs), (27) 



where a denotes a normalization constant and {mj^} are the fixed point messages, 

mliXs)=a J^-^,t{Xs,Xt)^t{Xt) [] <t(^t) ^X*. 



(28) 



uert\s 



Thus, 



V*iX,,Xt)=a^stiXs,Xt)^,iX,)^t{Xt) II mUXs) J] <t{Xt), (29) 

tiers\t tiert\s 

P*(X,) = a^siXs) n KMs)- (30) 

Combining these relations and assuming that ¥*{Xs) and mjg(Xs) are strictly positive, we can also obtain the consistent 
relation for the local conditionals, 

-*{X.„Xt) _ ^stiXs,Xt)^,{Xt)Uuer,\s^utiXt) 



f*{Xt\X,) 



'*ix.) 



rnliX, 



(31) 



6 BP Inference Using Learned Potentials 

The inference problem in pairwise MRFs is to compute the marginals or the log partition function for the model with 
learned potentials. Belief propagation is an iterative algorithm for performing approximate inference in MRFs. BP can 
also be viewed as an iterative algorithm for minimizing the Bethe free energy approximation to the log partition function. 
The results of this algorithm are a set of beliefs which can be used for obtaining the MAP assignment of the corresponding 
variables. 

The BP message update (with the learned potentials) is 

mts{Xs)=aj ^,t{Xs,Xt)^t{Xt) J] rn^t{Xt) dXt, (32) 

ueTt\s 

and at any iteration, the beliefs can be computed using the current messages, 

M,t{X,,Xt)=a-^st{Xs,Xt)^>,{X,)-^t{Xt) J] mUXs) \{ m,t(Xt), (33) 

u£Ts\t veTt\s 

B,(X,) = a^siXs) n T^us{Xs). (34) 
To see how the message update equation can be expressed using the true local conditional V*{Xt\Xg), we divide both size 
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of 



32 1 by the fixed point message Tnfg{Xs) during BP learning stage, and introduce 1 = |^"^^'''° ,^T^(^xl) ■ message 



update equation in (|32|l can then be re-written as 



tiSl t\s 



The behef at any iteration becomes 



B,(X,) = a^siXs) n mUXs) = I [] '"fSy^ ) ( «*.(^.) H ) • (36) 

We reparameterize the message as 

m,t{Xt) i — — . (37) 

mstiXt) 

Since the potentials are learned via BP, we can use the relation in ( [3T] l to obtain 

mtsiX,)^ I ¥*{Xt\X,) \[ m^t{Xt)dXt. (38) 



Similarly, we obtain from ( 30 1 that 



,{Xs) = [ n in,t{Xs)\v*{X,). (39) 



Messages from Evidence Node Given evidence xt at node Xt, the outgoing message from Xt to Xs is mts{Xs) 
a'^ st{X s T Xt)"^ t{xt) ■ Using similar reasoning to the case of an internal node, we have 



mUXs) mUXs) 

_ ^st{Xs, xt)'ft{xt) Uuer,\s <t(^t) 1 



(40) 
(41) 



mt{Xs) I\uert\s^utixt) 

= "^*(^*'^^)n \^FJ^) ^^^^ 

o^V*{xt\Xs) (43) 

where IltigrAs "^uti^t) is constant given a fixed value Xt = xt. Reparametrizing the message nits {X,) ^ ^^f{g, the 
outgoing message from the evidence node is simply the true likelihood function evaluated at Xt- 

7 A Note on Kernelization of Gaussian BP 

In this section, we consider the problem of defining a joint Gaussian graphical model in the feature space induced by a 
kernel. We follow Bickson (2008 ) in our presentation of the original Gaussian BP setting. We will show that assuming a 
Gaussian in an infinite feature space leads to challenges in interpretation and estimation of the model. 

Consider a pairwise MRF, 

¥{X)^l['f,{Xs) n 'fst{Xs,Xt). (44) 
sev (s,t)e£ 
In the case of the Gaussian, the probability density function takes the form 

P(X) cx exp (- 1 (X - m)^ A(X - fi)) 
cx exp (- iX^ AX - b^X) , 

where A = C^^ is the precision matrix, and 

A/i = b. 



Putting this in the form ( |44| , the node and edge potentials are written 

1 

' 2 



^s{Xs) = exp (-IXjAssXs + bsX,) (45) 



and 

^stiXs, Xt) ^ exp {-XjA.tXt) . (46) 
We now consider how these operations would appear in Hilbert space. In this case, we would have 

^siXs) := exp (-i (<^(X,), + (6„ , 

and 

^stiX,,Xt) exp (- {(t>{Xs), • 
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We call the Ass and Ast precision operators, by analogy with the finite dimensional case. At this point, we already 
encounter a potential difficulty in kernelizing Gaussian BP: how do we learn the operators Ass, bs and Ast from data? We 
could in principle define a covaiiance operator C with (s, t)th block the pairwise covariance operator Cst, but Ast would 
then be the (s, t)th block of C^^, which is difficult to compute. As we shall see below, however, these operators appear in 
the BP message updates. 



Next, we describe a message passing procedure for the Gaussian potentials in (45 1 and (46 1. The message from t to s is 
written 

mts{Xs 



J^st{Xs,Xt)^t{Xt) Yl mut{Xt)dXt. 



(a) 

We first consider term (a) in the above. We will assume, with justification to follow, that m„t(Xt) takes the form 

mut{Xt) oc exp {-^XjPutXt + t^ZtXt) , 
where the terms P^t and are defined by recursions specified below (we retain linear algebraic notation for simplicity). 
It follows that 'i/t{Xt) YiueFiXs TT^ut{Xt) is proportional to a Gaussian, 

^t{Xt) n rn^tiXt)^ex_p(^-^XjPt\sXt+tiJ\sXt) , 
«ert\s 

where we define the intermediate operators 



and 



Pt\s ■■= As 



uert\s 
«ert\s 



To compute the message mts{Xs), we integrate 

nitsiXs) = J ^st{Xs,Xt)^t{Xt) n TT^ut{Xt)dXt 



X 



uert\s 



J exp i-XtAtsXs) exp (^-^Xj Pt^sXt + f^J\sXt) dXt 



X 

Completing the square, we get the parameters of the message rrits in the standard form, 

Pts — ~AjsPt\lAts 
fJ-ts = -^^J\sPt\lAts- 



(47) 

(48) 
(49) 



There are two main difficulties in implementing the above procedure in feature space. First, it is not clear how to learn 
the precision operators from the data. Second, we need to invert these precision operators. Thus, it remains a challenging 
open question to define Gaussian BP in feature space. The feature space Gaussian BP updates may be contrasted with the 
kernel BP updates we propose in the main text. The latter have regularized closed form empirical estimates, and they are 



very different from the Gaussian BP form in (47 1 and parameter updates in (48 1 and (49 1 



8 Message Error Incurred by the Additional Feature Approximation 



We bound the difference between the estimated conditional embedding operator U-^^ \x '^^'^ counterpart h(-x<s \x ^^^i" 
further feature approximation. Assume ||0(a;)||jp- < 1, and define the tensor feature ^(x) ^u\s ^i^)- Denote by ^(x) 
and the respective approximations of ^(x) and 0. Furthermore, let the approximation eiTor after the incomplete QR 



decomposition be e = max \ max x 



{x) - 4>{x) 



i{x)-i{x) j.it 



: follows that 



< 
< 

< 



Cxfx, i^x^x 
{Cx»x, ~ ^ 



HS 



+ \nl) ^ — Cx»X i^X^X^, + ^ml) ^ 



1 



A. 



X^J X^){Cx^X, + \nl) ^ 
1 



xfx. 



-C 



xfx. 



HS 



3/2 



a: 



HS 



HS 



HS 



■-X^Xs 



HS 



(50) 
(51) 
(52) 
(53) 
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For the first term, 



HS 



HS 
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HS 
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< max Uiximxy - e>:)<^(x:)T 
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Similarly, for the second term. 
Combining the results, we obtain 



HS 
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"372 



HS 



XgXg ^XgXg 



1 1 HS ^ 
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HS 
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